top of page
• Hackers Realm

Feature Selection using Correlation Matrix (Numerical) | Machine Learning | Python

The correlation matrix measures the linear relationship between pairs of features in a dataset. It provides an indication of how strongly and in what direction two features are related. A correlation value ranges from -1 to 1, where -1 indicates a strong negative correlation, 0 indicates no correlation, and 1 indicates a strong positive correlation.

Additionally, correlation-based feature selection is best suited for problems where the relationship between features is expected to be linear. If you suspect non-linear relationships, you may need to explore other methods such as feature importance based on model performance or feature engineering techniques

You can watch the video-based tutorial with step by step explanation down below.

First we will have to load the data

```df = pd.read_csv('data/bike sharing dataset.csv')
• We will read the CSV file 'bike sharing dataset.csv' located in the 'data' directory and assign it to the DataFrame df using read_csv() function

• The head() method is called on the DataFrame df to display the first few rows of the modified DataFrame

Finding Correlation Matrix

Next we will create a correlation Matrix of the dataset

```corr = df.corr()
corr```
• In the above code snippet you will calculate the correlation matrix for the features in the DataFrame df and store it in the variable corr. You can then print corr to see the correlation matrix, which shows the pairwise correlations between all the features in the dataset

Display Correlation Matrix

Next we will display the correlation matrix in heatmap with which we can easily analyze the correlation matrix

```# display correlation matrix in heatmap
corr = df.corr()
plt.figure(figsize=(14,9))
sns.heatmap(corr, annot=True, cmap='coolwarm')```
• First we will calculate the correlation matrix

• We will set the figure size using plt.figure(figsize=(14, 9)) to make the heatmap larger and easier to read

• Then, we use sns.heatmap() to create the heatmap, passing the correlation matrix corr as the data. The annot=True argument adds the correlation values to the heatmap cells. The cmap='coolwarm' argument sets the color map for the heatmap

• Finally, we use plt.show() to display the heatmap

• cnt is the target variable of this correlation matrix

• From the heatmap we can infer that the casual and registered attributes have high correlation with target variable

• If you have high correlation then those attributes are treated as important attributes and with the help of those attributes we can easily predict target variable

• Any attribute whose range is above +0.05 or -0.05 that attribute will have some importance with the variable. Here you can see that temperature attribute has positive correlation of 0.4

• Based on hour attribute you can also predict how many vehicles will be rented by users in the particular hour

• You can also infer that the attributes holiday, weekday and workingday are not much important variables as their values are below +0.05 or -0.05

• To eliminate some of the features in the input variable you should check the complete data, here you can clearly see that the attributes atemp and temp has 0.99 correlation which is highly positive value. If you see correlation values more than 0.7 then you can drop any one of the feature as both the values represent a similar pattern

• You can also see that yr(year) is highly correlated with instant , instant attribute contains serial numbers which is of less importance so we can drop instant and you can also see mnth(month) is highly correlated with season so you can drop any one of them

• You can observe the correlation matrix more carefully and infer many other information from it

Final Thoughts

• Correlation matrix allows you to quickly identify highly correlated features, which can help in identifying redundant (or) overlapping information

• By removing highly correlated features, you can reduce dimensionality, improve model interpretability, and potentially enhance model performance by reducing noise and overfitting

• Correlation matrix-based feature selection considers pairwise relationships, but it may not account for the combined influence of multiple features on the target variable

• Correlation analysis assumes that the relationship between variables is linear and follows a normal distribution. If these assumptions are violated, the correlation results may not be accurate or meaningful

• Correlated features may still be important if they have non-linear or complex relationships with the target variable, which are not captured by correlation analysis alone

• It is important to consider domain knowledge, as well as the performance of the selected features in a chosen model, to ensure the most relevant and informative features are selected

• In summary, correlation matrix-based feature selection is a valuable technique to identify and remove highly correlated features. However, it should be used as part of a broader feature selection strategy, considering other methods and domain knowledge, to ensure a comprehensive and accurate selection of features for your specific machine learning problem

In this article we have explored how we can perform feature selection using correlation matrix. In future articles we will explore different methods to perform feature selection

Get the project notebook from here