• Hackers Realm

Loan Prediction Analysis using Python | Classification | Machine Learning Project Tutorial

Updated: Apr 9

Loan Prediction Analysis is a classification problem in which we need to classify whether the loan will be approved or not. Classification signifies a predictive modelling problem where a class label is predicted for a given example of input data. A few examples of classification problems are Credit Card Fraud Detection, Iris Dataset Analysis etc. A Loan Prediction Classification model is used to evaluate the loan status and build strategies.

In this project tutorial, we are learning about Loan Prediction and its Analysis in Python. It is a classification problem. The objective of this problem is to predict whether the loan will approve or not.

You can watch the video-based tutorial with step by step explanation down below.

Dataset Information

Dream Housing Finance company deals in all home loans. They have a presence across all urban, semi-urban and rural areas. Customer-first applies for a home loan after that company validates the customer's eligibility for a loan. The company wants to automate the loan eligibility process (real-time) based on customer detail provided while filling out the online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customer's segments, those are eligible for loan amount so that they can specifically target these customers.

This is a standard supervised classification task. A classification problem where we have to predict whether a loan would be approved or not. Below is the dataset attributes with a description.

Download the Dataset here

Import Modules

First, we have to import all the basic modules we will be needing for this project.

import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import matplotlib
%matplotlib inline
import warnings
  • pandas - used to perform data manipulation and analysis

  • numpy - used to perform a wide variety of mathematical operations on arrays

  • matplotlib - used for data visualization and graphical plotting

  • seaborn - built on top of matplotlib with similar functionalities

  • %matplotlib - to enable the inline plotting.

  • warnings - to manipulate warnings details filterwarnings('ignore') is to ignore the warnings thrown by the modules (gives clean results)

Loading the Dataset

df = pd.read_csv("Loan Prediction Dataset.csv")
  • We have to predict the output variable "Loan status".

  • The Input attributes are in categorical as well as in numerical form.

  • We have to analyze all the attributes.

Statistics Data Information

  • The total count column displays some missing values, which we will deal with later.

  • The credit history attributes are in the range of 0 to 1.

  • We can observe 13 attributes. Out of which 4 attributes are in float, 1 attribute is in integer and the other 8 are in objects.

  • We can change the object into corresponding data to reduce the usage memory.

  • However, we have 62 KB of memory usage, therefore we don't have to change any of the data types.

Preprocessing the Loan Sanction Data

Let us check for NULL values in the dataset.

# find the null values
  • We have found 6 columns having NULL values.

  • Now, we have to replace the NULL values with some common values.

Let us fill in the missing values for numerical terms using mean.

# fill the missing values for numerical terms - mean
df['LoanAmount'] = df['LoanAmount'].fillna(df['LoanAmount'].mean())
df['Loan_Amount_Term'] = df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mean())
df['Credit_History'] = df['Credit_History'].fillna(df['Credit_History']
  • All the missing values will be filled with the mean of the current column.

Let us now fill in the missing values for categorical terms using mode operation.

# fill the missing values for categorical terms - mode
df['Gender'] = df["Gender"].fillna(df['Gender'].mode()[0])
df['Married'] = df["Married"].fillna(df['Married'].mode()[0])
df['Dependents'] = df["Dependents"].fillna(df['Dependents'].mode()[0])
df['Self_Employed'] = df["Self_Employed"].fillna(df['Self_Employed'].mode()[0])
  • All the missing values will be filled with the most frequently occurring values.

  • Modes give the result in their terms of the data frame, so we only need the values. We will specify 0th index to display the values.

Now, let's check for the NULL values again.

  • All the NULL values are now replaced.

Exploratory Data Analysis

Let us first explore the categorical column "Gender".

# categorical attributes visualization
  • The majority of the applicant is male and a handful is female.

  • From these analyses, we will get an intuition that will be useful in building the model.

To display the column "Married".

  • The majority of the applicants are married.

To display the column "Dependents".

  • The majority of the applicants have zero dependents, around 100 applicants have one or two dependents and only a few have more than three dependents.

To display the column "Education".


To display the column "Self Employed".

  • Around 90 applicants are either freelancers or run a business.

To display the column "Property Area".

  • We can assume that the applicants are equally distributed in urban, rural and semi-urban areas.

To display the column "Loan Status".

  • Around 400 loans are accepted and 200 loans are rejected. Its shows the 2:1 ratio.

Let us first explore the Numerical column "Applicant Income".

# numerical attributes visualization
  • The data are skewed left in the graph, which is not a suitable distribution to train a Model.

  • Hence, we will apply the Log Transformation later to normalize the attributes in the form of Bell Curve (Normal Distribution).

To display the column "Co-applicant Income".

  • We have to normalize this graph as well.

To display the column "Loan Amount".

  • The loan amount graph is slightly right-skewed. We will consider this for Normalization.

To display the column "Loan Amount Term".

  • The majority of them are filled will main values, that is the highest values. We will apply log transformation of this as well.

To display the column "Credit History".

  • Since the values of credit history are in the range of 0 to 1, we don't need to normalize this graph.

Creation of new attributes

We can create a new attribute performing Log Transformation. We can also create a new attribute Total Income, that is the sum of Applicant Income and Co-applicant Income.

# total income
df['Total_Income'] = df['ApplicantIncome'] + df['CoapplicantIncome']

Log Transformation

Log transformation helps to make the highly skewed distribution to less skewed. Instead of changing the column, we will add the data into a new column by writing 'Log' after each column.

To display the column "Applicant Income Log".

# apply log transformation to the attribute
df['ApplicantIncomeLog'] = np.log(df['ApplicantIncome']+1)
  • We can observe a Normal distribution in a form of a Bell Curve.

To display the column "Co-applicant Income Log".

df['CoapplicantIncomeLog'] = np.log(df['CoapplicantIncome']+1)

To display the column "Loan Amount Log".

df['LoanAmountLog'] = np.log(df['LoanAmount']+1)

To display the column "Loan Amount Term Log".

df['Loan_Amount_Term_Log'] = np.log(df['Loan_Amount_Term']+1)
  • The Loan amount term is slightly better than before. Despite the fact that it is skewed right.

To display the column "Total Income Log".

df['Total_Income_Log'] = np.log(df['Total_Income']+1)
  • We can observe the normal distribution of the newly created column 'Total Income'.

After normalizing all the data in the dataset, let's check the correlation matrix.

Correlation Matrix

For this project, the correlation matrix will discover the correlation for numerical attributes.

corr = df.corr()
sns.heatmap(corr, annot = True, cmap="BuPu")
  • In this graph, the higher density is plotted with dark color and the lower density is plotted with light color.

  • We need to remove the highly correlated attributes.

  • It means the original attributes are correlated with log attributes.

  • We will remove the previous attributes and keep the log attributes to train our model.

To check the values of the dataset.


Let us drop some unnecessary columns.

# drop unnecessary columns
cols = ['ApplicantIncome', 'CoapplicantIncome', "LoanAmount", "Loan_Amount_Term", "Total_Income", 'Loan_ID', 'CoapplicantIncomeLog']
df = df.drop(columns=cols, axis=1)
  • Out of all previous columns, we will keep 'Credit History'.

Label Encoding

We will use label encoding to convert the categorical column into the numerical column.

from sklearn.preprocessing import LabelEncoder
cols = ['Gender', "Married", "Education", 'Self_Employed', "Property_Area", "Loan_Status", "Dependents"]
le = LabelEncoder()
for col in cols:
    df[col] = le.fit_transform(df[col])
  • We access each column from the column list. And for the corresponding column, the 'le.fit_transform()' function will convert the values into numerical then store them into the corresponding column.

  • All the values of the dataset are now in numerical format. It will help us to train our model easily.

  • For Loan status 1 indicates 'Yes' and 0 indicates 'No'.

Splitting the data for Training and Testing

Before training and testing, we have to specify the input and output attributes.

# specify input and output attributes
X = df.drop(columns=['Loan_Status'], axis=1)
y = df['Loan_Status']

Let us now split the data.

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
  • We will add random_state with the attribute 42 to get same split upon re-running.

  • If you don't specify random state, it will randomly split the data upon re-running giving inconsistent results.

Model Training

# classify function
from sklearn.model_selection import cross_val_score
def classify(model, x, y):
    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
    model.fit(x_train, y_train)
    print("Accuracy is", model.score(x_test, y_test)*100)
    # cross validation - it is used for better validation of model
    # eg: cv-5, train-4, test-1
    score = cross_val_score(model, x, y, cv=5)
    print("Cross validation is",np.mean(score)*100)
  • Here, cross-validation will split the data set into multiple parts.

  • For example; cv=5 means, it will split the data into 5 parts.

  • For each iteration, the training will use 4 parts and testing will use 1 part.

  • You can change the cross-validation with the common term 3 or 5.

Logistic Regression:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
classify(model, X, y)
  • Since cross-validation deals with multiple parts, we have to focus on cross-validation percentage, which is an overall accuracy of the model.

Decision Tree:

from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
classify(model, X, y)
  • The decision tree does not show good results.

Random Forest:

from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier
model = RandomForestClassifier()
classify(model, X, y)
  • Random forest shows better results than a Decision tree.

Extra Trees:

model = ExtraTreesClassifier()
classify(model, X, y)