The correlation matrix measures the linear relationship between pairs of features in a dataset. It provides an indication of how strongly and in what direction two features are related. A correlation value ranges from -1 to 1, where -1 indicates a strong negative correlation, 0 indicates no correlation, and 1 indicates a strong positive correlation.

Additionally, correlation-based feature selection is best suited for problems where the relationship between features is expected to be linear. If you suspect non-linear relationships, you may need to explore other methods such as feature importance based on model performance or feature engineering techniques

You can watch the video-based tutorial with step by step explanation down below.

## Load the Dataset

First we will have to load the data

```
df = pd.read_csv('data/bike sharing dataset.csv')
df.head()
```

We will read the CSV file 'bike sharing dataset.csv' located in the 'data' directory and assign it to the DataFrame df using read_csv() function

The head() method is called on the DataFrame df to display the first few rows of the modified DataFrame

## Finding Correlation Matrix

Next we will create a correlation Matrix of the dataset

```
corr = df.corr()
corr
```

In the above code snippet you will calculate the correlation matrix for the features in the DataFrame df and store it in the variable corr. You can then print corr to see the correlation matrix, which shows the pairwise correlations between all the features in the dataset

## Display Correlation Matrix

Next we will display the correlation matrix in heatmap with which we can easily analyze the correlation matrix

```
# display correlation matrix in heatmap
corr = df.corr()
plt.figure(figsize=(14,9))
sns.heatmap(corr, annot=True, cmap='coolwarm')
```

First we will calculate the correlation matrix

We will set the figure size using plt.figure(figsize=(14, 9)) to make the heatmap larger and easier to read

Then, we use sns.heatmap() to create the heatmap, passing the correlation matrix corr as the data. The annot=True argument adds the correlation values to the heatmap cells. The cmap='coolwarm' argument sets the color map for the heatmap

Finally, we use plt.show() to display the heatmap

cnt is the target variable of this correlation matrix

From the heatmap we can infer that the casual and registered attributes have high correlation with target variable

If you have high correlation then those attributes are treated as important attributes and with the help of those attributes we can easily predict target variable

Any attribute whose range is above +0.05 or -0.05 that attribute will have some importance with the variable. Here you can see that temperature attribute has positive correlation of 0.4

Based on hour attribute you can also predict how many vehicles will be rented by users in the particular hour

You can also infer that the attributes holiday, weekday and workingday are not much important variables as their values are below +0.05 or -0.05

To eliminate some of the features in the input variable you should check the complete data, here you can clearly see that the attributes atemp and temp has 0.99 correlation which is highly positive value. If you see correlation values more than 0.7 then you can drop any one of the feature as both the values represent a similar pattern

You can also see that yr(year) is highly correlated with instant , instant attribute contains serial numbers which is of less importance so we can drop instant and you can also see mnth(month) is highly correlated with season so you can drop any one of them

You can observe the correlation matrix more carefully and infer many other information from it

## Final Thoughts

Correlation matrix allows you to quickly identify highly correlated features, which can help in identifying redundant (or) overlapping information

By removing highly correlated features, you can reduce dimensionality, improve model interpretability, and potentially enhance model performance by reducing noise and overfitting

Correlation matrix-based feature selection considers pairwise relationships, but it may not account for the combined influence of multiple features on the target variable

Correlation analysis assumes that the relationship between variables is linear and follows a normal distribution. If these assumptions are violated, the correlation results may not be accurate or meaningful

Correlated features may still be important if they have non-linear or complex relationships with the target variable, which are not captured by correlation analysis alone

It is important to consider domain knowledge, as well as the performance of the selected features in a chosen model, to ensure the most relevant and informative features are selected

In summary, correlation matrix-based feature selection is a valuable technique to identify and remove highly correlated features. However, it should be used as part of a broader feature selection strategy, considering other methods and domain knowledge, to ensure a comprehensive and accurate selection of features for your specific machine learning problem

In this article we have explored how we can perform feature selection using correlation matrix. In future articles we will explore different methods to perform feature selection

Get the project notebook from here

Thanks for reading the article!!!

Check out more project videos from the YouTube channel Hackers Realm