Hackers Realm

# Mall Customer Segmentation Analysis using Python | Clustering | Machine Learning Project Tutorial

The Mall Customer Segmentation Analysis is a clustering problem that comes under unsupervised learning. It is a dataset of customers with Spending scores and want to divide a group of customers and the data can be given to marketing team to plan the strategy accordingly.

In this project tutorial, we will explore Mall Customer Segmentation Analysis using python. Furthermore, we will discuss unsupervised learning, principal component analysis, kmeans clustering and elbow method in this tutorial.

You can watch the video-based tutorial with step by step explanation down below.

**Dataset Information**

You are owing a supermarket mall and through membership cards, you have some basic data about your customers. Spending Score is something you assign to the customer based on your defined parameters like customer behavior and purchasing data.

**Attributes**

Customer ID

Age

Gender

Annual income

Spending score

*Download the Dataset *__here__

**Import Modules**

**import** **pandas** **as** **pd**
**import** **numpy** **as** **np**
**import** **seaborn** **as** **sns**
**import** **matplotlib****.****pyplot** **as** **plt**
**from** **mpl_toolkits****.****mplot3d** **import** Axes3D
**import** **warnings**
%**matplotlib** inline
warnings.filterwarnings('ignore')

**pandas**- used to perform data manipulation and analysis**numpy**- used to perform a wide variety of mathematical operations on arrays**matplotlib**- used for data visualization and graphical plotting**seaborn**- built on top of matplotlib with similar functionalities**%matplotlib inline**- to enable the inline plotting**warnings**- to manipulate warnings details

filterwarnings('ignore') is to ignore the warnings thrown by the modules (gives clean results)

**Load the Dataset**

```
df = pd.read_csv('/kaggle/input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')
df.head()
```

We can see the top 5 samples of the dataset

CustomerID is not necessary for the process so it can be dropped

*# statistical info*
df.describe()

Statistical information of the dataset with min. and max range in every column.

*# datatype info*
df.info()

Only one attribute is categorical and the rest are numerical

There are no NULL values present in the data, further preprocessing is not necessary

If any NULL value are present in dataset, they must be replaced with a value or drop the entire row

**Exploratory Data Analysis**

`sns.countplot(df['Gender'])`

We can see an almost equal distribution but female has majority

`sns.distplot(df['Age'])`

Good distribution of the data, majority of the customers between age 30 to 40 years old

`sns.distplot(df['Annual Income (k$)'])`

We can see the Annual Income, with a good distribution

`sns.distplot(df['Spending Score (1-100)'])`

Average spending is between 40 to 60

**Correlation Matrix**

A correlation matrix is a table showing correlation coefficients between variables.

```
corr = df.corr()
sns.heatmap(corr, annot=
```**True**, cmap='coolwarm')

The red color shows a positive correlation, and the blue color is a negative correlation.

In supervised learning, we can drop highly correlated attributes.

Since this is unsupervised learning, we will reduce the dimension of the dataset using principal component analysis.

**Clustering**

`df.head()`

*# cluster on **2** features*
df1 = df[['Annual Income (k$)', 'Spending Score (1-100)']]
df1.head()

First, Let us take only two attributes for processing

*# scatter plot*
sns.scatterplot(df1['Annual Income (k$)'], df1['Spending Score (1-100)'])

Scatter plot of Annual income and Spending Score

We can see the major part is in the center so that can be one cluster and the corners can be four other clusters or grouped for two other clusters.

**Now we can start clustering the data**

**from** **sklearn****.****cluster** **import** KMeans
errors = []
**for** i **in** range(1, 11):
kmeans = KMeans(n_clusters=i)
kmeans.fit(df1)
errors.append(kmeans.inertia_)

errors list will contains the sum of squared distances of samples to their closest cluster center

*# plot the results **for** elbow method*
plt.figure(figsize=(13,6))
plt.plot(range(1,11), errors)
plt.plot(range(1,11), errors, linewidth=3, color='red', marker='8')
plt.xlabel('No. of clusters')
plt.ylabel('WCSS')
plt.xticks(np.arange(1,11,1))
plt.show()

We use elbow methods to find the number of clusters.

The shape in a graph represents an elbow.

We take the best cluster number from the joint of the elbow.

The best cluster appears to be 5.

```
km = KMeans(n_clusters=5)
km.fit(df1)
y = km.predict(df1)
df1['Label'] = y
df1.head()
```

Added cluster label for each sample

`sns.scatterplot(x='Annual Income (k$)', y='Spending Score (1-100)', data=df1, hue='Label', s=50, palette=['red', 'green', 'brown', 'blue', 'orange'])`

Scatter plot graph of the clustered data

Depending on the analysis of the data you can send specific offers to a group of customers in a cluster

**Now let us use a three dimension data**

*# cluster on **3** features*
df2 = df[['Annual Income (k$)', 'Spending Score (1-100)', 'Age']]
df2.head()

```
errors = []
```**for** i **in** range(1, 11):
kmeans = KMeans(n_clusters=i)
kmeans.fit(df2)
errors.append(kmeans.inertia_)

*# plot the results **for** elbow method*
plt.figure(figsize=(13,6))
plt.plot(range(1,11), errors)
plt.plot(range(1,11), errors, linewidth=3, color='red', marker='8')
plt.xlabel('No. of clusters')
plt.ylabel('WCSS')
plt.xticks(np.arange(1,11,1))
plt.show()

The most optimal cluster is still 5.

```
km = KMeans(n_clusters=5)
km.fit(df2)y = km.predict(df2)
df2['Label'] = y
df2.head()
```

Added cluster label for each sample in new data

*# **3**d scatter plot*
fig = plt.figure(figsize=(20,15))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df2['Age'][df2['Label']==0], df2['Annual Income (k$)'][df2['Label']==0], df2['Spending Score (1-100)'][df2['Label']==0], c='red', s=50)
ax.scatter(df2['Age'][df2['Label']==1], df2['Annual Income (k$)'][df2['Label']==1], df2['Spending Score (1-100)'][df2['Label']==1], c='green', s=50)
ax.scatter(df2['Age'][df2['Label']==2], df2['Annual Income (k$)'][df2['Label']==2], df2['Spending Score (1-100)'][df2['Label']==2], c='blue', s=50)
ax.scatter(df2['Age'][df2['Label']==3], df2['Annual Income (k$)'][df2['Label']==3], df2['Spending Score (1-100)'][df2['Label']==3], c='brown', s=50)
ax.scatter(df2['Age'][df2['Label']==4], df2['Annual Income (k$)'][df2['Label']==4], df2['Spending Score (1-100)'][df2['Label']==4], c='orange', s=50)
ax.view_init(30, 190)
ax.set_xlabel('Age')
ax.set_ylabel('Annual Income')
ax.set_zlabel('Spending Score')
plt.show()

3D scatter plot graph of the data

**ax.scatter() -**plots the data points by filtering and specify the color for each clusterYou may change the

**view_init()**parameters for a different angle view of the scatterplotYou may use different plot method for a different view.