top of page
  • Writer's pictureHackers Realm

Loan Prediction Analysis using Python | Classification | Machine Learning Project Tutorial

Updated: Apr 15

Loan Prediction Analysis is a classification problem in which we need to classify whether the loan will be approved or not. Classification signifies a predictive modelling problem where a class label is predicted for a given example of input data. A few examples of classification problems are Credit Card Fraud Detection, Iris Dataset Analysis etc. A Loan Prediction Classification model is used to evaluate the loan status and build strategies.


In this project tutorial, we are learning about Loan Prediction and its Analysis in Python. It is a classification problem. The objective of this problem is to predict whether the loan will approve or not.



You can watch the video-based tutorial with step by step explanation down below.


Dataset Information


Dream Housing Finance company deals in all home loans. They have a presence across all urban, semi-urban and rural areas. Customer-first applies for a home loan after that company validates the customer's eligibility for a loan. The company wants to automate the loan eligibility process (real-time) based on customer detail provided while filling out the online application form.


These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customer's segments, those are eligible for loan amount so that they can specifically target these customers.

This is a standard supervised classification task. A classification problem where we have to predict whether a loan would be approved or not. Below is the dataset attributes with a description.

Download the Dataset here



Import Modules


First, we have to import all the basic modules we will be needing for this project.

import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import matplotlib
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
  • pandas - used to perform data manipulation and analysis

  • numpy - used to perform a wide variety of mathematical operations on arrays

  • matplotlib - used for data visualization and graphical plotting

  • seaborn - built on top of matplotlib with similar functionalities

  • %matplotlib - to enable the inline plotting.

  • warnings - to manipulate warnings details filterwarnings('ignore') is to ignore the warnings thrown by the modules (gives clean results)



Loading the Dataset


df = pd.read_csv("Loan Prediction Dataset.csv")
df.head()
  • We have to predict the output variable "Loan status".

  • The Input attributes are in categorical as well as in numerical form.

  • We have to analyze all the attributes.



Statistics Data Information


df.describe()
  • The total count column displays some missing values, which we will deal with later.

  • The credit history attributes are in the range of 0 to 1.



df.info()
  • We can observe 13 attributes. Out of which 4 attributes are in float, 1 attribute is in integer and the other 8 are in objects.

  • We can change the object into corresponding data to reduce the usage memory.

  • However, we have 62 KB of memory usage, therefore we don't have to change any of the data types.



Preprocessing the Loan Sanction Data


Let us check for NULL values in the dataset.

# find the null values
df.isnull().sum()
  • We have found 6 columns having NULL values.

  • Now, we have to replace the NULL values with some common values.



Let us fill in the missing values for numerical terms using mean.

# fill the missing values for numerical terms - mean
df['LoanAmount'] = df['LoanAmount'].fillna(df['LoanAmount'].mean())
df['Loan_Amount_Term'] = df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mean())
df['Credit_History'] = df['Credit_History'].fillna(df['Credit_History']
  • All the missing values will be filled with the mean of the current column.


Let us now fill in the missing values for categorical terms using mode operation.

# fill the missing values for categorical terms - mode
df['Gender'] = df["Gender"].fillna(df['Gender'].mode()[0])
df['Married'] = df["Married"].fillna(df['Married'].mode()[0])
df['Dependents'] = df["Dependents"].fillna(df['Dependents'].mode()[0])
df['Self_Employed'] = df["Self_Employed"].fillna(df['Self_Employed'].mode()[0])
  • All the missing values will be filled with the most frequently occurring values.

  • Modes give the result in their terms of the data frame, so we only need the values. We will specify 0th index to display the values.



Now, let's check for the NULL values again.

df.isnull().sum()
  • All the NULL values are now replaced.



Exploratory Data Analysis


Let us first explore the categorical column "Gender".

# categorical attributes visualization
sns.countplot(df['Gender'])
  • The majority of the applicant is male and a handful is female.

  • From these analyses, we will get an intuition that will be useful in building the model.



To display the column "Married".

sns.countplot(df['Married'])
  • The majority of the applicants are married.



To display the column "Dependents".

sns.countplot(df['Dependents'])
  • The majority of the applicants have zero dependents, around 100 applicants have one or two dependents and only a few have more than three dependents.



To display the column "Education".

sns.countplot(df['Education'])


To display the column "Self Employed".

sns.countplot(df['Self_Employed'])
  • Around 90 applicants are either freelancers or run a business.


To display the column "Property Area".

sns.countplot(df['Property_Area'])
  • We can assume that the applicants are equally distributed in urban, rural and semi-urban areas.



To display the column "Loan Status".

sns.countplot(df['Loan_Status'])
  • Around 400 loans are accepted and 200 loans are rejected. Its shows the 2:1 ratio.



Let us first explore the Numerical column "Applicant Income".

# numerical attributes visualization
sns.distplot(df["ApplicantIncome"])
  • The data are skewed left in the graph, which is not a suitable distribution to train a Model.

  • Hence, we will apply the Log Transformation later to normalize the attributes in the form of Bell Curve (Normal Distribution).



To display the column "Co-applicant Income".

sns.distplot(df["CoapplicantIncome"])
  • We have to normalize this graph as well.



To display the column "Loan Amount".

sns.distplot(df["LoanAmount"])