Bike Sharing Demand Analysis is a regression problem which helps to predict the demand of the bicycles for a particular time of the day with the help of python. This article focus on predicting bike renting and returning in different areas of a city during a future period based on historical data, weather data, and time data. Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city.
In this project tutorial, we will analyze and process the dataset to predict the bike rental demand based on collected data in a specific time period and under weather conditions.
You can watch the video-based tutorial with step by step explanation down below.
Dataset Information
Bike-sharing systems are a new generation of traditional bike rentals where the whole process from membership, rental and return back has become automatic. Through these systems, the user is able to easily rent a bike from a particular position and return back to another position. Currently, there are about over 500 bike-sharing programs around the world which are composed of over 500 thousand bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.
Apart from interesting real-world applications of bike-sharing systems, the characteristics of data being generated by these systems make them attractive for research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns the bike-sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that the most important events in the city could be detected via monitoring these data.
Attribute Information:
Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv
instant: record index
dteday : date
season : season (1:winter, 2:spring, 3:summer, 4:fall)
yr : year (0: 2011, 1:2012)
mnth : month ( 1 to 12)
hr : hour (0 to 23)
holiday : weather day is holiday or not (extracted from [Web Link])
weekday : day of the week
workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
weathersit :
1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
hum: Normalized humidity. The values are divided to 100 (max)
windspeed: Normalized wind speed. The values are divided to 67 (max)
casual: count of casual users
registered: count of registered users
cnt: count of total rental bikes including both casual and registered
Here, the output variable is "cnt".
Download the Dataset here
Import Modules
Let us import all the basic modules we will be needing for this project.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
%matplotlib inline
warnings.filterwarnings('ignore')
pd.options.display.max_columns = 999
pandas - used to perform data manipulation and analysis
numpy - used to perform a wide variety of mathematical operations on arrays
matplotlib - used for data visualization and graphical plotting
seaborn - built on top of matplotlib with similar functionalities
%matplotlib - to enable the inline plotting.
warnings - to manipulate warnings details
filterwarnings('ignore') is to ignore the warnings thrown by the modules (gives clean results)
max-columns is to display all the features.
Loading the Dataset
df = pd.read_csv('hour.csv')
df.head()
Later we will drop the unnecessary column "casual" and "registered".
While using feature engineering, we need to mention the categorical column in one hot encoding.
# statistical info
df.describe()
There are no missing values in the dataset.
# datatype info
df.info()
We will drop other unnecessary columns 'instant' and 'dteday'.
The datatype of the remaining column is float and integer.
# unique values
df.apply(lambda x: len(x.unique()))
Attributes containing many unique values are of numerical type. The remaining attributes are of categorical type.
Preprocessing the dataset
Data preprocessing refers to preparing (cleaning and organizing) the raw data to make it suitable for building and training Machine Learning models.
Let us check for NULL values in the dataset.
# check for null values
df.isnull().sum()
There are no NULL values in the dataset.
We will rename the columns to year, month and hour for a better understanding.
df = df.rename(columns={'weathersit':'weather',
'yr':'year',
'mnth':'month',
'hr':'hour',
'hum':'humidity',
'cnt':'count'})
df.head()
The attributes now contain meaningful titles.
Let us drop unnecessary columns.
df = df.drop(columns=['instant', 'dteday', 'year'])
For better visualization, let us change the Int column into a categorical column.
# change int columns to category
cols = ['season','month','hour','holiday','weekday','workingday','weather']
for col in cols:
df[col] = df[col].astype('category')
df.info()
The selected columns are converted into categorical columns.
Later we will use the remaining numerical column to find the correlation.
Exploratory Data Analysis
We will analyze the data using visual techniques in terms of time and other attributes.
Let us start with the Time.
fig, ax = plt.subplots(figsize=(20,10))
sns.pointplot(data=df, x='hour', y='count', hue='weekday', ax=ax)
ax.set(title='Count of bikes during weekdays and weekends')
The X-axis is the hour and Y-axis is the count of the bike.
On weekdays, we observe a peak in the morning hours and in the evening.
On weekends, the peak value is in the afternoon.
Let us use the same attributes with causal.
fig, ax = plt.subplots(figsize=(20,10))
sns.pointplot(data=df, x='hour', y='casual', hue='weekday', ax=ax)
ax.set(title='Count of bikes during weekdays and weekends: Unregistered users')
The graph shows the count of unregistered users throughout the week.
We observe the high count on weekends.
This data can be related to weekend outdoor activities.
Let us use the same attributes with registered users.
fig, ax = plt.subplots(figsize=(20,10))
sns.pointplot(data=df, x='hour', y='registered', hue='weekday', ax=ax)
ax.set(title='Count of bikes during weekdays and weekends: Registered users')
The graph shows the count of registered users throughout the week.
This data can be related to the working personnel.
Let us explore the graph in terms of weather.
fig, ax = plt.subplots(figsize=(20,10))
sns.pointplot(data=df, x='hour', y='count', hue='weather', ax=ax)
ax.set(title='Count of bikes during different weathers')
The graph is similar to the previous graphs except for the weather 4.
Weather 4 with the color red must indicate rain, where no users book the bike.
Let us explore the graph in terms of season.
fig, ax = plt.subplots(figsize=(20,10))
sns.pointplot(data=df, x='hour', y='count', hue='season', ax=ax)
ax.set(title='Count of bikes during different seasons')
Out of four-season, three seasons show a similar graph.
Let us explore the graph in terms of months.
fig, ax = plt.subplots(figsize=(20,10))
sns.barplot(data=df, x='month', y='count', ax=ax)
ax.set(title='Count of bikes during different months')
Over a period of time, the number of users increases and gradually, the number of users decreases.
Let us explore the graph in terms of weekdays.
fig, ax = plt.subplots(figsize=(20,10))
sns.barplot(data=df, x='weekday', y='count', ax=ax)
ax.set(title='Count of bikes during different days')
In this graph, we observe an average number of users throughout the week.
Thus, the average distribution is impractical for predictions.
Regression plot of temperature and humidity with respect to count.
fig, (ax1,ax2) = plt.subplots(ncols=2, figsize=(20,6))
sns.regplot(x=df['temp'], y=df['count'], ax=ax1)
ax1.set(title="Relation between temperature and users")
sns.regplot(x=df['humidity'], y=df['count'], ax=ax2)
ax2.set(title="Relation between humidity and users")
With the increase in temperature, the number of user increases.
When the humidity increases the number of users decreases.
from statsmodels.graphics.gofplots import qqplot
fig, (ax1,ax2) = plt.subplots(ncols=2, figsize=(20,6))
sns.distplot(df['count'], ax=ax1)
ax1.set(title='Distribution of the users')
qqplot(df['count'], ax=ax2, line='s')
ax2.set(title='Theoretical quantiles')
We can see a huge numerical difference in the distribution of the users, so the data is not equally distributed
Most of the data are in zero in the theoretical quantiles, so we must convert the data to approximate as much as possible as the red line
Now we will apply log transformation to uniform the data
df['count'] = np.log(df['count'])
fig, (ax1,ax2) = plt.subplots(ncols=2, figsize=(20,6))
sns.distplot(df['count'], ax=ax1)
ax1.set(title='Distribution of the users')
qqplot(df['count'], ax=ax2, line='s')
ax2.set(title='Theoritical quantiles')
Now the distribution is more uniform, meaning the data was converted accordingly
Now the data in the theoretical quantiles is very similar to the red line
You may use MIN-MAX normalization or Standardization to see different results
Correlation Matrix
corr = df.corr()
plt.figure(figsize=(15,10))
sns.heatmap(corr, annot=True, annot_kws={'size':15})
We use the correlation matrix for numerical data.
We observe a highly positive correlation between 'temp' and 'atemp' and between 'casual' and 'registered'.
'Windspeed' displays an insignificant contribution to the count.
Hence, we will drop a few unnecessary columns later.
One hot Encoding
pd.get_dummies(df['season'], prefix='season', drop_first=True)
Display of the dataset of the seasons, if specific season is present in the data it will assign 1 in the corresponding column and the other columns will be 0.
The prefix is to include the word in the column name, in this case it's for better understanding
Drop_first drops the first column, so if the all the no. are 0 in the remaining three columns, that means season 1 is present.
df_oh = df
def one_hot_encoding(data, column):
data = pd.concat([data, pd.get_dummies(data[column], prefix=column, drop_first=True)], axis=1)
data = data.drop([column], axis=1)
return data
cols = ['season','month','hour','holiday','weekday','workingday','weather']
for col in cols:
df_oh = one_hot_encoding(df_oh, col)
df_oh.head()
New data frame after hot one encoding the data, adding new features
With the additional features added this will increase the training process time as well as the accuracy
Input Split
Now we will drop the columns we don't need for the model training
X = df_oh.drop(columns=['atemp', 'windspeed', 'casual', 'registered', 'count'], axis=1)
y = df_oh['count']
Model Training
from sklearn.linear_model import LinearRegression, Ridge, HuberRegressor, ElasticNetCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor
models = [LinearRegression(),
Ridge(),
HuberRegressor(),
ElasticNetCV(),
DecisionTreeRegressor(),
RandomForestRegressor(),
ExtraTreesRegressor(),
GradientBoostingRegressor()]
from sklearn import model_selection
def train(model):
kfold = model_selection.KFold(n_splits=5, random_state=42)
pred = model_selection.cross_val_score(model, X, y, cv=kfold, scoring='neg_mean_squared_error')
cv_score = pred.mean()
print('Model:',model)
print('CV score:', abs(cv_score))
for model in models:
train(model)
Various models were imported to see different results
These are common models for regression problems, you may investigate and use other models for other results.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
Splitting the dataset for training and testing
model = RandomForestRegressor()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
Random Forest gave minimal error, so we are training to see the residuals from the test data
# plot the error difference
error = y_test - y_pred
fig, ax = plt.subplots()
ax.scatter(y_test, error)
ax.axhline(lw=3, color='black')
ax.set_xlabel('Observed')
ax.set_ylabel('Error')
plt.show()
Visualization of the predicted error values in the data set, both positive and negative
from sklearn.metrics import mean_squared_error
np.sqrt(mean_squared_error(y_test, y_pred))
Mean squared error from the test data and the predicted data
Final Thoughts
Out of the 8 models, Random Forest Regressor is the top performer with the least cv score.
You may do various analysis with the variety of results given from the different models used.
You can also use hyperparameter tuning to improve the model performance.
In this article, we explored the Bike Sharing Demand data set using various machine learning techniques and plot graphs. We also compared different models to train the data starting from basic to advanced models.
Get the project notebook from here
Thanks for reading the article!!!
Check out more project videos from the YouTube channel Hackers Realm
Kommentare