Boston House Price Prediction Analysis using Python | Regression

Dive into the world of Boston house price prediction using Python! This comprehensive blog tutorial explores regression techniques and machine learning algorithms. Learn data preprocessing, feature engineering, and model evaluation. Gain hands-on experience with regression algorithms like linear regression, decision trees, and random forests. Enhance your Python programming, data analysis, and machine learning skills through this step-by-step project tutorial. Join us on this exciting journey of Boston house price prediction! #BostonHousePricePrediction #Python #Regression #MachineLearning #DataAnalysis

Boston Housing Price Prediction - Regression

In this project tutorial, we are learning about boston house price prediction analysis with the help of machine learning. The objective of this problem is to predict the monetary value of a house located the boston suburbs.

You can watch the video-based tutorial with step by step explanation down below

Dataset Information

Boston House Prices Dataset was collected in 1978 and has 506 entries with 14 attributes (or) features for homes from various suburbs in Boston.

Attribute Information:

- CRIM per capita crime rate by town

- ZN proportion of residential land zoned for lots over 25,000 sq.ft.

- INDUS proportion of non-retail business acres per town

- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

- NOX nitric oxides concentration (parts per 10 million)

- RM average number of rooms per dwelling

- AGE proportion of owner-occupied units built prior to 1940

- DIS weighted distances to five Boston employment centers

- RAD index of accessibility to radial highways

- TAX full-value property-tax rate per $10,000

- PTRATIO pupil-teacher ratio by town

- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

- LSTAT % lower status of the population

- MEDV Median value of owner-occupied homes in $1000's

MEDV is the price we have to predict. The given value ($1000) correspondence to 1.
We will predict the target variable from the given 13 input attributes.
We can ignore some attributes if it's not beneficial while predicting the output variable.
We can also create some new features from the available attributes.

Download the Dataset here

Import Modules

First, let us import all the basic modules we will be needing for this project.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
%matplotlib inline
warnings.filterwarnings('ignore')

pandas - used to perform data manipulation and analysis
numpy - used to perform a wide variety of mathematical operations on arrays
matplotlib - used for data visualization and graphical plotting
seaborn - built on top of matplotlib with similar functionalities
%matplotlib - to enable the inline plotting.
warnings - to manipulate warnings details
filterwarnings('ignore') is to ignore the warnings thrown by the modules (gives clean results)

Loading the Dataset

df = pd.read_csv("Boston Dataset.csv")
df.drop(columns=['Unnamed: 0'], axis=0, inplace=True)
df.head()

We have dropped the unnecessary column 'Unnamed : 0'.

Statistical Information.

# statistical info
df.describe()

There are no NULL values.
All other values are adequate.

Datatype Information.

# datatype info
df.info()

All the columns are in numerical datatype.
We will create new categorical columns using the existing columns later.

Preprocessing the dataset

# check for null values
df.isnull().sum()

No NULL values were found.

Exploratory Data Analysis

Let us create box plots for all columns to identify the outliers.

# create box plots
fig, ax = plt.subplots(ncols=7, nrows=2, figsize=(20, 10))
index = 0
ax = ax.flatten()

for col, value in df.items():
    sns.boxplot(y=col, data=df, ax=ax[index])
    index += 1
plt.tight_layout(pad=0.5, w_pad=0.7, h_pad=5.0)

We are using for loop to create subplots.

In the graph, the dots represent the outliers.
The column containing many outliers does not follow the normal distribution.
We can minimalize outliers with log transformation.
We can also drop the column which contains outliers (or) we can delete the rows which contains the same.

Let us create distribution plots for all columns.

# create dist plot
fig, ax = plt.subplots(ncols=7, nrows=2, figsize=(20, 10))
index = 0
ax = ax.flatten()

for col, value in df.items():
    sns.distplot(value, ax=ax[index])
    index += 1
plt.tight_layout(pad=0.5, w_pad=0.7, h_pad=5.0)

We can observe right skewed and left skewed graphs for 'crim', 'zn', 'tax', and 'black'.
Therefore, we need to normalize these data.

Min-Max Normalization

We will create the column list for the 4 columns and use Min-Max Normalization.

cols = ['crim', 'zn', 'tax', 'black']
for col in cols:
    # find minimum and maximum of that column
    minimum = min(df[col])
    maximum = max(df[col])
    df[col] = (df[col] - minimum) / (maximum - minimum)

The last line shows the formula for min-max normalization.
It will execute this code for the selected 4 columns.

fig, ax = plt.subplots(ncols=7, nrows=2, figsize=(20, 10))
index = 0
ax = ax.flatten()

for col, value in df.items():
    sns.distplot(value, ax=ax[index])
    index += 1
plt.tight_layout(pad=0.5, w_pad=0.7, h_pad=5.0)

Distribution Plot of Attributes after Min Max Normalization

Now the range of these columns is between 0 to 1.
Min-Max Normalization transformed the maximum value as '1' and the minimum value as '0'.

Standardization For Attributes

Standardization uses mean and standard deviation. Here, preprocessing.StandardScaler( ) is the standardization function.

# standardization
from sklearn import preprocessing
scalar = preprocessing.StandardScaler()

# fit our data
scaled_cols = scalar.fit_transform(df[cols])
scaled_cols = pd.DataFrame(scaled_cols, columns=cols)
scaled_cols.head()

The above shown are the Standardized values.

Let us get back to our Original database.

for col in cols:
    df[col] = scaled_cols[col]

This code will assign the standardized value to the original data frame.

To Display the Standardized value in subplots.

fig, ax = plt.subplots(ncols=7, nrows=2, figsize=(20, 10))
index = 0
ax = ax.flatten()

for col, value in df.items():
    sns.distplot(value, ax=ax[index])
    index += 1
plt.tight_layout(pad=0.5, w_pad=0.7, h_pad=5.0)

Distribution Plot of Attributes after Standard Scaling

Even now the columns 'crim', 'zn', 'tax', and 'black' does not show a perfect normal distribution.
However, the standardized value of these columns will slightly improve the model performance.

Over-fitting vs Under-fitting

We will now discuss crucial differences between Over-fitting and Under-fitting with the help of three examples. Each graph contains two classes 'X' and 'O'.

For Under-Fitting: We have a straight line representing the under-fitted model. It implies the model is not well trained, and the model data is limited. There are many misclassifications between X and O.
For Appropriate-Fitting: We have a non-linear curve representing good-fitted model. It means the model is perfectly trained. There are only a few misclassifications.
For Over-Fitting: We have a complete curve representing an accurate prediction of classes. It indicates that the model is overtrained and has many features in it.

The Appropriate-Fitting is a Generalized Model which is good for training and testing.

The below graph contains examples for bias and variance.

High-bias (Under-Fit): It contains few features. Hence it gets a simple straight line as a result of Regression.
High-bias (Good-Fit): It contains sufficient features. Hence it gets a non-linear curve representing an accurately predicted pattern.
High-variance (Over-Fit): It contains high no. of features. Hence it captures all information thus provides a complex curve.

In a nutshell, Over-fitting shows good performance on the training data and poor generalization to test data. Whereas, Under-fitting displays poor performance on the training data and poor generalization to test data.

Based on the number of features the model can be under-fitted and over-fitted.
Always aim for a good-fit Model.

Correlation Matrix

corr = df.corr()
plt.figure(figsize=(20,10))
sns.heatmap(corr, annot=True, cmap='coolwarm')

We mostly focus on the target variable as this is a Regression problem.
But we can also observe other highly correlated attributes by column 'tax' and 'rad'.
We will later eliminate this correlation by ignoring any of the variables.
Additionally, we will display 'lstat' and 'rm' to show their correlation with the target variable 'medv'.

Relation between Target and Correlated Variables

sns.regplot(y=df['medv'], x=df['lstat'])

Here, the price of houses decreases with the increase in the 'lstat'. Hence it is negatively correlated.

sns.regplot(y=df['medv'], x=df['rm'])

Here, the prices of houses increase with the increase in 'rm'. Hence it is positively correlated.

Input Split

Let us split the data for training and testing.

X = df.drop(columns=['medv', 'rad'], axis=1)
y = df['medv']

Model Training

Now let's import functions to train models.

Instead of training the whole model, we will split the dataset for estimating the model performance.
If you train and test the dataset completely, the results will be inaccurate. Hence, we will use 'train_test_split'.
We will add random_state with the attribute 42 to get the same split upon re-running.
If you don't specify a random state, it will randomly split the data upon re-running giving inconsistent results.

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
def train(model, X, y):
    # train the model
    x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=42)
    model.fit(x_train, y_train)
    
    # predict the training set
    pred = model.predict(x_test)
    
    # perform cross-validation
    cv_score = cross_val_score(model, X, y, scoring='neg_mean_squared_error', cv=5)
    cv_score = np.abs(np.mean(cv_score))
    
    print("Model Report")
    print("MSE:",mean_squared_error(y_test, pred))
    print('CV Score:', cv_score)

X contains input attributes and y contains the output attribute.
We use 'cross val score' for better validation of the model.
Here, cv=5 means that the cross-validation will split the data into 5 parts.
np.abs will convert the negative score to positive and np.mean will give the average value of 5 scores.

Linear Regression:

from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize=True)
train(model, X, y)
coef = pd.Series(model.coef_, X.columns).sort_values()
coef.plot(kind='bar', title='Model Coefficients')

Mean Squared Error is around 23 and Cross-Validation Score is around 35.

Since Normalization is important for basic models like linear regression, we can state it as normalize=True.
rm shows a high positive coefficient and nox shows a high negative coefficient.

Decision Tree:

from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
train(model, X, y)
coef = pd.Series(model.feature_importances_, X.columns).sort_values(ascending=False)
coef.plot(kind='bar', title='Feature Importance')

Here CV Score is higher than Linear Regression.

Random Forest:

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
train(model, X, y)
coef = pd.Series(model.feature_importances_, X.columns).sort_values(ascending=False)
coef.plot(kind='bar', title='Feature Importance')

MSE is around 10 and CV score is around 21.

We know that 'rm' and 'lstat' produces much correlation with the target variable. That is the reason behind its higher feature importance among other attributes.

Extra Trees:

from sklearn.ensemble import ExtraTreesRegressor
model = ExtraTreesRegressor()
train(model, X, y)
coef = pd.Series(model.feature_importances_, X.columns).sort_values(ascending=False)
coef.plot(kind='bar', title='Feature Importance')

Here MSE is similar to Random Forrest and the cross-validation score is decreased.

XGBoost:

import xgboost as xgb
model = xgb.XGBRegressor()
train(model, X, y)
coef = pd.Series(model.feature_importances_, X.columns).sort_values(ascending=False)
coef.plot(kind='bar', title='Feature Importance')

MSE, as well as CV Score, is the least among all other models.

The XGBoost shows the best result with minimal cross validation error.

Final Thoughts

To summarize, XGBoost Regressor works best for this project.
You can further improve the model by creating new attributes and performing hyperparameter tuning.
You can create a new categorical attribute with the help of an existing numerical attributes.

In this article, we discussed the dataset for Boston House Price Prediction. We understood the difference between over-fitting and under-fitting and examined the methods to achieve the good-fit model.

Get the project notebook from here

Thanks for reading the article!!!

Check out more project videos from the YouTube channel Hackers Realm

Boston House Price Prediction Analysis using Python | Regression | Machine Learning Project Tutorial