- Hackers Realm

# Black Friday Sales Prediction Analysis using Python | Regression | Machine Learning Project Tutorial

Black Friday Sales Prediction is a regression problem where we have to analyze and predict the sales of an product in the retail store based on various aspects of the dataset. The objective is to build a predictive model and discover the sales of each product.

In this project tutorial, we analyze and predict the sales during Black Friday, and display the results through plot graphs and different prediction models.

You can watch the video-based tutorial with step by step explanation down below

**Dataset Information**

This dataset comprises of sales transactions captured at a retail store. Itâ€™s a classic dataset to explore and expand your feature engineering skills and day to day understanding from multiple shopping experiences. This is a regression problem. The dataset has 550,069 rows and 12 columns.

**Problem:** Predict purchase amount

**Attributes:**

Masked attributes hide the data information.

*Download the Dataset *__here__

**Import modules**

**import** **pandas** **as** **pd**
**import** **numpy** **as** **np**
**import** **seaborn** **as** **sns**
**import** **matplotlib****.****pyplot** **as** **plt**
**import** **warnings**
%**matplotlib** inline
warnings.filterwarnings('ignore')

**pandas**- used to perform data manipulation and analysis**numpy**- used to perform a wide variety of mathematical operations on arrays**matplotlib**- used for data visualization and graphical plotting**seaborn**- built on top of matplotlib with similar functionalities**%matplotlib**- to enable the inline plotting.**warnings**- to manipulate warnings details**filterwarnings('ignore')**is to ignore the warnings thrown by the modules (gives clean results)

**Loading the dataset**

```
df = pd.read_csv('train.csv')
df.head()
```

Some columns have null values, those values must be replaced for a relevant value for further processing.

**Let us see the statistical information of the attributes**

*# statistical info*
df.describe()

Statistical information of the data

**Product_Category_2**and**Product_Category_3**have lower number of samples than**Product_Category_1**, both could be sub categories.

**Let us see the data type information of the attributes**

*# datatype info*
df.info()

We have categorical as well as numerical attributes which we will process separately.

Product_Category_1 data type is different from Product_Category_2 and Product_Category_3, that won't affect the process or the result.

*# find unique values*
df.apply(**lambda** x: len(x.unique()))

Attributes containing many unique values are of numerical type. The remaining attributes are of categorical type.

**Exploratory Data Analysis**

*# distplot **for** purchase*
plt.style.use('fivethirtyeight')
plt.figure(figsize=(13, 7))
sns.distplot(df['Purchase'], bins=25)

First part of the graph has a normal distribution and later forming some peaks in the graph

Evaluating the whole graph, it has a normal distribution

*# distribution **of** numeric variables*
sns.countplot(df['Gender'])

Many buyers are male while the minority are female.

Difference is due to the categories on sale during Black Friday, evaluating a particular category may change the count between genders.

`sns.countplot(df['Age'])`

There are 7 categories defined to classify the age of the buyers

`sns.countplot(df['Marital_Status'])`

Majority of the buyers are single

`sns.countplot(df['Occupation'])`

Display of the occupation of the buyers

Occupation 8 has extremely low count compared with the others; it can be ignored for the calculation since it won't affect much the result.

`sns.countplot(df['Product_Category_1'])`

Majority of the products are in category 1, 5 and 8.

The low no. categories can be combined into a single category to greatly reduce the complexity of the problem.

`sns.countplot(df['Product_Category_2'])`

Categories are in float values

Categories 2, 8, 14 to 16 are higher compared with the others.

`sns.countplot(df['Product_Category_3'])`

Categories are in float values

Categories 14 to 17 are higher

`sns.countplot(df['City_Category'])`

Higher count might represent the urban area indicates more population

`sns.countplot(df['Stay_In_Current_City_Years'])`

Most buyers have one year living in the city

Remaining categories are uniform distribution

**Now let us plot using two variables for analysis**

*# bivariate analysis*
occupation_plot = df.pivot_table(index='Occupation', values='Purchase', aggfunc=np.mean)
occupation_plot.plot(kind='bar', figsize=(13, 7))
plt.xlabel('Occupation')
plt.ylabel("Purchase")
plt.title("Occupation and Purchase Analysis")
plt.xticks(rotation=0)
plt.show()

**np.mean**will display mean of the purchase based on occupation**np.sum**will display a sum of the purchase based on occupationBased on the labels, we can observe all the categories being purchased in an average manner.

Recommended plot graph for presentation

```
age_plot = df.pivot_table(index='Age', values='Purchase', aggfunc=np.mean)
age_plot.plot(kind='bar', figsize=(13, 7))
plt.xlabel('Age')
plt.ylabel("Purchase")
plt.title("Age and Purchase Analysis")
plt.xticks(rotation=0)
plt.show()
```

Age and Purchase graph also has a uniform distribution.

```
gender_plot = df.pivot_table(index='Gender', values='Purchase', aggfunc=np.mean)
gender_plot.plot(kind='bar', figsize=(13, 7))
plt.xlabel('Gender')
plt.ylabel("Purchase")
plt.title("Gender and Purchase Analysis")
plt.xticks(rotation=0)
plt.show()
```

Uniform distribution but with a little difference

**Preprocessing the dataset**

**We must check first for null values in the data**

*# check **for** **null** values*
df.isnull().sum()

Null values are present in

**Product_Category_2**and**Product_Category_3**Null values must be filled for easier processing

**Now we fill the Null values in the dataset**

```
df['Product_Category_2'] = df['Product_Category_2'].fillna(-2.0).astype("float32")
df['Product_Category_3'] = df['Product_Category_3'].fillna(-2.0).astype("float32")
```

Null values filled with a negative value to not affect the results.

The value filled must be of same data type of the attribute.

**Let us double check the null values**

`df.isnull().sum()`

**Now we must convert the categorical attributes to numerical using a dictionary**

*# encoding values using dict*
gender_dict = {'F':0, 'M':1}
df['Gender'] = df['Gender'].apply(**lambda** x: gender_dict[x])
df.head()

'F' now converted to numerical zero (0), same for 'M' to one (1)

**Label encoding is to convert the categorical column into the numerical column a lot quicker**

*# to improve the metric use one hot encoding*
*# label encoding*
cols = ['Age', 'City_Category', 'Stay_In_Current_City_Years']
**from** **sklearn****.****preprocessing** **import** LabelEncoder
le = LabelEncoder()
**for** col **in** cols:
df[col] = le.fit_transform(df[col])
df.head()

One hot encoding increases the no. of columns but improves accuracy

More columns means more data to train, it will increase the training time

All categorical columns converted to numerical

For the input

**User_ID**and**Product_ID**must be removed in order to generalize the results.

**Coorelation Matrix**

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. The value is in the range of -1 to 1. If two variables have a high correlation, we can neglect one variable from those two.

```
corr = df.corr()
plt.figure(figsize=(14,7))
sns.heatmap(corr, annot=
```**True**, cmap='coolwarm')

**Purchase**is most correlated to**Product_Category_1**and**Product_Category_3****Marital_Status**and**Age**also has positive correlation

**Input Split**

`df.head()`

**User_ID**and**Product_ID**must be removed for better results, if not the results will be biased to**User_ID**or**Product_ID**

**Now we split the data for training**

```
X = df.drop(columns=['User_ID', 'Product_ID', 'Purchase'])
y = df['Purchase']
```

Purchase is an output data that is why it is removed from X as well

**Model Training**

**from** **sklearn****.****model_selection** **import** cross_val_score, train_test_split
**from** **sklearn****.****metrics** **import** mean_squared_error
**def** train(model, X, y):
*# train**-**test split*
x_train, x_test, y_train, y_test = train_test_split(X, y,
random_state=42, test_size=0.25)
model.fit(x_train, y_train)
*# predict the results*
pred = model.predict(x_test)
*# cross validation*
cv_score = cross_val_score(model, X, y, scoring='neg_mean_squared_error', cv=5)
cv_score = np.abs(np.mean(cv_score))
print("Results")
print("MSE:", np.sqrt(mean_squared_error(y_test, pred)))
print("CV Score:", np.sqrt(cv_score))

**cross val score()**is used for better validation of the model.**cv=5**means that the cross-validation will split the data into 5 parts for training.**np.abs()**will convert the negative score to positive and**np.mean()**will give the average value of 5 scores.

**Now we display the basic models **

**from** **sklearn****.****linear_model** **import** LinearRegression
model = LinearRegression(normalize=**True**)
train(model, X, y)
coef = pd.Series(model.coef_, X.columns).sort_values()
coef.plot(kind='bar', title='Model Coefficients')

Results MSE: 4617.994034201719 CV Score: 4625.252945835687

Linear Regression model must have normalized data to give better results

**Gender**category has high coefficient for the Linear Regression model

**from** **sklearn****.****tree** **import** DecisionTreeRegressor
model = DecisionTreeRegressor()
train(model, X, y)
features = pd.Series(model.feature_importances_, X.columns).sort_values(ascending=**False**)
features.plot(kind='bar', title='Feature Importance')

Results MSE: 3366.9672356860747 CV Score: 3338.5905886644855

Results have improved compared to Linear Regression model

**Product_Category_1**has high feature importance compared to the Linear Regression model.

**from** **sklearn****.****ensemble** **import** RandomForestRegressor
model = RandomForestRegressor(n_jobs=-1)
train(model, X, y)
features = pd.Series(model.feature_importances_, X.columns).sort_values(ascending=**False**)
features.plot(kind='bar', title='Feature Importance')

Results MSE: 3062.66041010778 CV Score: 3052.7778119222253

Better results compared with Decision Tree Regressor

**from** **sklearn****.****ensemble** **import** ExtraTreesRegressor
model = ExtraTreesRegressor(n_jobs=-1)
train(model, X, y)
features = pd.Series(model.feature_importances_, X.columns).sort_values(ascending=**False**)
features.plot(kind='bar', title='Feature Importance&