- Hackers Realm

# Boston House Price Prediction Analysis using Python | Regression | Machine Learning Project Tutorial

Updated: Apr 9

**Boston House Price Prediction** is a regression problem where we have to predict the price of a house based on some dependent variables. Prediction of the monetary value of a residence using machine learning reflects a promising economy. This regression problem leads to an influential topic of** overfitting** and u**nderfitting**.

In this project tutorial, we are learning about boston house price prediction analysis with the help of machine learning. The objective of this problem is to predict the monetary value of a house located the boston suburbs.

You can watch the video-based tutorial with step by step explanation down below

**Dataset Information**

Boston House Prices Dataset was collected in 1978 and has **506 entries** with **14 attributes (or) features** for homes from various suburbs in Boston.

##### Attribute Information:

- **CRIM** per capita crime rate by town

- **ZN ** proportion of residential land zoned for lots over 25,000 sq.ft.

-** INDUS** proportion of non-retail business acres per town

- **CHAS ** Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

- **NOX** nitric oxides concentration (parts per 10 million)

- **RM ** average number of rooms per dwelling

- **AGE ** proportion of owner-occupied units built prior to 1940

- **DIS** weighted distances to five Boston employment centers

- **RAD** index of accessibility to radial highways

- **TAX** full-value property-tax rate per $10,000

- **PTRATIO **pupil-teacher ratio by town

- **B ** 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

- **LSTAT** % lower status of the population

- **MEDV** Median value of owner-occupied homes in $1000's

**MEDV**is the price we have to predict. The given value ($1000) correspondence to 1.We will predict the target variable from the given 13 input attributes.

We can ignore some attributes if it's not beneficial while predicting the output variable.

We can also create some new features from the available attributes.

*Download the Dataset *__here__

**Import Modules**

First, let us import all the basic modules we will be needing for this project.

```
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
%matplotlib inline
warnings.filterwarnings('ignore')
```

pandas - used to perform data manipulation and analysis

numpy - used to perform a wide variety of mathematical operations on arrays

matplotlib - used for data visualization and graphical plotting

seaborn - built on top of matplotlib with similar functionalities

%matplotlib - to enable the inline plotting.

warnings - to manipulate warnings details

filterwarnings('ignore') is to ignore the warnings thrown by the modules (gives clean results)

**Loading the Dataset**

```
df = pd.read_csv("Boston Dataset.csv")
df.drop(columns=['Unnamed: 0'], axis=0, inplace=True)
df.head()
```

We have dropped the unnecessary column 'Unnamed : 0'.

**Statistical Information.**

```
# statistical info
df.describe()
```

There are no NULL values.

All other values are adequate.

**Datatype Information.**

```
# datatype info
df.info()
```

All the columns are in numerical datatype.

We will create new categorical columns using the existing columns later.

**Preprocessing the dataset**

```
# check for null values
df.isnull().sum()
```

No NULL values were found.

**Exploratory Data Analysis**

**Let us create box plots for all columns to identify the outliers.**

```
# create box plots
fig, ax = plt.subplots(ncols=7, nrows=2, figsize=(20, 10))
index = 0
ax = ax.flatten()
for col, value in df.items():
sns.boxplot(y=col, data=df, ax=ax[index])
index += 1
plt.tight_layout(pad=0.5, w_pad=0.7, h_pad=5.0)
```

We are using

**for loop**to create subplots.

In the graph, the dots represent the

**outliers**.The column containing many outliers does not follow the normal distribution.

We can minimalize outliers with log transformation.

We can also drop the column which contains outliers (or) we can delete the rows which contains the same.

**Let us create distribution plots for all columns.**

```
# create dist plot
fig, ax = plt.subplots(ncols=7, nrows=2, figsize=(20, 10))
index = 0
ax = ax.flatten()
for col, value in df.items():
sns.distplot(value, ax=ax[index])
index += 1
plt.tight_layout(pad=0.5, w_pad=0.7, h_pad=5.0)
```

We can observe right skewed and left skewed graphs for '

**crim**', '**zn**', '**tax**', and '**black**'.Therefore, we need to normalize these data.

**Min-Max Normalization**

**We will create the column list for the 4 columns and use Min-Max Normalization.**

```
cols = ['crim', 'zn', 'tax', 'black']
for col in cols:
# find minimum and maximum of that column
minimum = min(df[col])
maximum = max(df[col])
df[col] = (df[col] - minimum) / (maximum - minimum)
```

The last line shows the formula for min-max normalization.

It will execute this code for the selected 4 columns.

```
fig, ax = plt.subplots(ncols=7, nrows=2, figsize=(20, 10))
index = 0
ax = ax.flatten()
for col, value in df.items():
sns.distplot(value, ax=ax[index])
index += 1
plt.tight_layout(pad=0.5, w_pad=0.7, h_pad=5.0)
```

Now the range of these columns is between 0 to 1.

Min-Max Normalization transformed the maximum value as '1' and the minimum value as '0'.

**Standardization For Attributes**

Standardization uses mean and standard deviation. Here,** preprocessing.StandardScaler( )** is the standardization function.

```
# standardization
from sklearn import preprocessing
scalar = preprocessing.StandardScaler()
# fit our data
scaled_cols = scalar.fit_transform(df[cols])
scaled_cols = pd.DataFrame(scaled_cols, columns=cols)
scaled_cols
```**.**head()

The above shown are the Standardized values.

**Let us get back to our Original database.**

```
for col in cols:
df[col] = scaled_cols[col]
```

This code will assign the standardized value to the original data frame.

**To Display the Standardized value in subplots.**

```
fig, ax = plt.subplots(ncols=7, nrows=2, figsize=(20, 10))
index = 0
ax = ax.flatten()
for col, value in df.items():
sns.distplot(value, ax=ax[index])
index += 1
plt.tight_layout(pad=0.5, w_pad=0.7, h_pad=5.0)
```

Even now the columns '

**crim**', '**zn**', '**tax**', and '**black**' does not show a perfect normal distribution.However, the standardized value of these columns will slightly improve the model performance.

**Over-fitting vs Under-fitting**

We will now discuss crucial differences between Over-fitting and Under-fitting with the help of three examples. Each graph contains two classes 'X' and 'O'.

**For Under-Fitting:**We have a straight line representing the under-fitted model. It implies the model is not well trained, and the model data is limited. There are many misclassifications between X and O.**For Appropriate-Fitting:**We have a non-linear curve representing good-fitted model. It means the model is perfectly trained. There are only a few misclassifications.**For Over-Fitting:**We have a complete curve representing an accurate prediction of classes. It indicates that the model is overtrained and has many features in it.

**The Appropriate-Fitting is a Generalized Model which is good for training and testing.**

**The below graph contains **examples for bias and variance**.**

**High-bias (Under-Fit):**It contains few features. Hence it gets a simple straight line as a result of Regression.**High-bias (Good-Fit):**It contains sufficient features. Hence it gets a non-linear curve representing an accurately predicted pattern.**High-variance (Over-Fit):**It contains high no. of features. Hence it captures all information thus provides a complex curve.

**In a nutshell**,** Over-fitting shows good performance on the training data and poor generalization to test data. Whereas, Under-fitting displays poor performance on the training data and poor generalization to test data.**

Based on the number of features the model can be under-fitted and over-fitted.

Always aim for a good-fit Model.

**Correlation Matrix**

```
corr = df.corr()
plt.figure(figsize=(20,10))
sns.heatmap(corr, annot=True, cmap='coolwarm')
```

We mostly focus on the target variable as this is a Regression problem.

But we can also observe other highly correlated attributes by column

**'tax'**and**'rad'**.We will later eliminate this correlation by ignoring any of the variables.

Additionally, we will display

**'lstat'**and**'rm'**to show their correlation with the target variable**'medv'**.

**Relation between Target and Correlated Variables**

`sns.regplot(y=df['medv'], x=df['lstat'])`

Here, the price of houses decreases with the increase in the

**'lstat'**. Hence it is negatively correlated.

`sns.regplot(y=df['medv'], x=df['rm'])`

Here, the prices of houses increase with the increase in

**'rm'**. Hence it is positively correlated.

**Input Split**

**Let us split the data for training and testing.**

```
X = df.drop(columns=['medv', 'rad'], axis=1)
y = df['medv']
```

**Model Training**

**Now let's import functions to train models.**

Instead of training the whole model, we will split the dataset for estimating the model performance.

If you train and test the dataset completely, the results will be inaccurate. Hence, we will use

**'train_test_split'**.We will add

**random_state**with the attribute 42 to get the same split upon re-running.If you don't specify a random state, it will randomly split the data upon re-running giving inconsistent results.

```
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
def train(model, X, y):
# train the model
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=42)
model.fit(x_train, y_train)
# predict the training set
pred = model.predict(x_test)
# perform cross-validation
cv_score = cross_val_score(model, X, y, scoring='neg_mean_squared_error', cv=5)
cv_score = np.abs(np.mean(cv_score))
print("Model Report")
print("MSE:",mean_squared_error(y_test, pred))
print('CV Score:', cv_score)
```

**X**contains input attributes and**y**contains the output attribute.We use

**'cross val score'**for better validation of the model.Here,

**cv=5**means that the cross-validation will split the data into 5 parts.**np.abs**will convert the negative score to positive and**np.mean**will give the average value of 5 scores.

**Linear Regression:**

```
from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize=True)
train(model, X, y)
coef = pd.Series(model.coef_, X.columns).sort_values()
coef.plot(kind='bar', title='Model Coefficients')
```

Mean Squared Error is around 23 and Cross-Validation Score is around 35.

Since Normalization is important for basic models like linear regression, we can state it as

**normalize=True.****rm**shows a high positive coefficient and**nox**shows a high negative coefficient.

**Decision Tree:**

```
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
train(model, X, y)
coef = pd.Series(model.feature_importances_, X.columns).sort_values(ascending=False)
coef.plot(kind='bar', title='Feature Importance')
```

Here CV Score is higher than Linear Regression.

**Random Forest:**

```
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
train(model, X, y)
coef = pd.Series(model.feature_importances_, X.columns).sort_values(ascending=False)
coef.plot(kind='bar', title='Feature Importance')
```

MSE is around 10 and CV score is around 21.

We know that

**'rm'**and**'lstat'**produces much correlation with the target variable. That is the reason behind its higher feature importance among other attributes.

**Extra Trees:**

```
from sklearn.ensemble import ExtraTreesRegressor
model = ExtraTreesRegressor()
train(model, X, y)
coef = pd.Series(model.feature_importances_, X.columns).sort_values(ascending=False)
coef.plot(kind='bar', title='Feature Importance')
```

Here MSE is similar to Random Forrest and the cross-validation score is decreased.

**XGBoost:**

```
import xgboost as xgb
model = xgb.XGBRegressor()
train(model, X, y)
coef = pd.Series(model.feature_importances_, X.columns).sort_values(ascending=False)
coef.plot(kind='bar', title='Feature Importance')
```

MSE, as well as CV Score, is the least among all other models.

The XGBoost shows the best result with minimal cross validation error.

**Final Thoughts**

To summarize, XGBoost Regressor works best for this project.

You can further improve the model by creating new attributes and performing hyperparameter tuning.

You can create a new categorical attribute with the help of an existing numerical attributes.

In this article, we discussed the dataset for Boston House Price Prediction. We understood the difference between over-fitting and under-fitting and examined the methods to achieve the good-fit model.

**Get the project notebook from **__here__

*Thanks for reading the article!!!*

*Check out more project videos from the YouTube channel *__Hackers Realm__