Hackers Realm
Boston House Price Prediction Analysis using Python | Regression | Machine Learning Project Tutorial
Updated: Apr 15
Boston House Price Prediction is a regression problem where we have to predict the price of a house based on some dependent variables. Prediction of the monetary value of a residence using machine learning reflects a promising economy. This regression problem leads to an influential topic of overfitting and underfitting.

In this project tutorial, we are learning about boston house price prediction analysis with the help of machine learning. The objective of this problem is to predict the monetary value of a house located the boston suburbs.
You can watch the video-based tutorial with step by step explanation down below
Dataset Information
Boston House Prices Dataset was collected in 1978 and has 506 entries with 14 attributes (or) features for homes from various suburbs in Boston.
Attribute Information:
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centers
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
MEDV is the price we have to predict. The given value ($1000) correspondence to 1.
We will predict the target variable from the given 13 input attributes.
We can ignore some attributes if it's not beneficial while predicting the output variable.
We can also create some new features from the available attributes.
Download the Dataset here
Import Modules
First, let us import all the basic modules we will be needing for this project.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
%matplotlib inline
warnings.filterwarnings('ignore')
pandas - used to perform data manipulation and analysis
numpy - used to perform a wide variety of mathematical operations on arrays
matplotlib - used for data visualization and graphical plotting
seaborn - built on top of matplotlib with similar functionalities
%matplotlib - to enable the inline plotting.
warnings - to manipulate warnings details
filterwarnings('ignore') is to ignore the warnings thrown by the modules (gives clean results)
Loading the Dataset
df = pd.read_csv("Boston Dataset.csv")
df.drop(columns=['Unnamed: 0'], axis=0, inplace=True)
df.head()

We have dropped the unnecessary column 'Unnamed : 0'.
Statistical Information.
# statistical info
df.describe()

There are no NULL values.
All other values are adequate.
Datatype Information.
# datatype info
df.info()

All the columns are in numerical datatype.
We will create new categorical columns using the existing columns later.
Preprocessing the dataset
# check for null values
df.isnull().sum()

No NULL values were found.
Exploratory Data Analysis
Let us create box plots for all columns to identify the outliers.
# create box plots
fig, ax = plt.subplots(ncols=7, nrows=2, figsize=(20, 10))
index = 0
ax = ax.flatten()
for col, value in df.items():
sns.boxplot(y=col, data=df, ax=ax[index])
index += 1
plt.tight_layout(pad=0.5, w_pad=0.7, h_pad=5.0)
We are using for loop to create subplots.

In the graph, the dots represent the outliers.
The column containing many outliers does not follow the normal distribution.
We can minimalize outliers with log transformation.
We can also drop the column which contains outliers (or) we can delete the rows which contains the same.
Let us create distribution plots for all columns.
# create dist plot
fig, ax = plt.subplots(ncols=7, nrows=2, figsize=(20, 10))
index = 0
ax = ax.flatten()
for col, value in df.items():
sns.distplot(value, ax=ax[index])
index += 1
plt.tight_layout(pad=0.5, w_pad=0.7, h_pad=5.0)

We can observe right skewed and left skewed graphs for 'crim', 'zn', 'tax', and 'black'.
Therefore, we need to normalize these data.
Min-Max Normalization
We will create the column list for the 4 columns and use Min-Max Normalization.
cols = ['crim', 'zn', 'tax', 'black']
for col in cols:
# find minimum and maximum of that column
minimum = min(df[col])
maximum = max(df[col])
df[col] = (df[col] - minimum) / (maximum - minimum)
The last line shows the formula for min-max normalization.
It will execute this code for the selected 4 columns.
fig, ax = plt.subplots(ncols=7, nrows=2, figsize=(20, 10))
index = 0
ax = ax.flatten()
for col, value in df.items():
sns.distplot(value, ax=ax[index])
index += 1
plt.tight_layout(pad=0.5, w_pad=0.7, h_pad=5.0)

Now the range of these columns is between 0 to 1.
Min-Max Normalization transformed the maximum value as '1' and the minimum value as '0'.
Standardization For Attributes
Standardization uses mean and standard deviation. Here, preprocessing.StandardScaler( ) is the standardization function.
# standardization
from sklearn import preprocessing
scalar = preprocessing.StandardScaler()
# fit our data
scaled_cols = scalar.fit_transform(df[cols])
scaled_cols = pd.DataFrame(scaled_cols, columns=cols)
scaled_cols.head()

The above shown are the Standardized values.
Let us get back to our Original database.
for col in cols:
df[col] = scaled_cols[col]
This code will assign the standardized value to the original data frame.
To Display the Standardized value in subplots.
fig, ax = plt.subplots(ncols=7, nrows=2, figsize=(20, 10))
index = 0
ax = ax.flatten()
for col, value in df.items():
sns.distplot(value, ax=ax[index])
index += 1
plt.tight_layout(pad=0.5, w_pad=0.7, h_pad=5.0)

Even now the columns 'crim', 'zn', 'tax', and 'black' does not show a perfect normal distribution.
However, the standardized value of these columns will slightly improve the model performance.
Over-fitting vs Under-fitting
We will now discuss crucial differences between Over-fitting and Under-fitting with the help of three examples. Each graph contains two classes 'X' and 'O'.

For Under-Fitting: We have a straight line representing the under-fitted model. It implies the model is not well trained, and the model data is limited. There are many misclassifications between X and O.
For Appropriate-Fitting: We have a non-linear curve representing good-fitted model. It means the model is perfectly trained. There are only a few misclassifications.
For Over-Fitting: We have a complete curve representing an accurate prediction of classes. It indicates that the model is overtrained and has many features in it.
The Appropriate-Fitting is a Generalized Model which is good for training and testing.
The below graph contains examples for bias and variance.

High-bias (Under-Fit): It contains few features. Hence it gets a simple straight line as a result of Regression.
High-bias (Good-Fit): It contains sufficient features. Hence it gets a non-linear curve representing an accurately predicted pattern.
High-variance (Over-Fit): It contains high no. of features. Hence it captures all information thus provides a complex curve.
In a nutshell, Over-fitting shows good performance on the training data and poor generalization to test data. Whereas, Under-fitting displays poor performance on the training data and poor generalization to test data.
Based on the number of features the model can be under-fitted and over-fitted.
Always aim for a good-fit Model.